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Abstract — Sparsity of representations of signals has been 
shown to be a key concept of fundamental importance in fields 
such as blind source separation, compression, sampling and signal 
analysis. The aim of this paper is to compare several commonly- 
used sparsity measures based on intuitive attributes. Intuitively, 
a sparse representation is one in which a small number of 
coefficients contain a large proportion of the energy. In this paper 
six properties are discussed: (Robin Hood, Scaling, Rising Tide, 
Cloning, Bill Gates and Babies), each of which a sparsity measure 
should have. The main contributions of this paper are the proofs 
and the associated summary table which classify commonly-used 
sparsity measures based on whether or not they satisfy these six 
propositions. Only one of these measures satisfies all six: the Gini 
Index. 



I. Introduction 

Whether with sparsity constraints or with sparsity assump- 
tions, the concept of sparsity is readily used in diverse areas 
such as oceanic engineering 1 1 1, antennas and propagation ||2l, 
face recognition |3|, image processing H, jS] and medical 
imaging |6|. Sparsity has also played a central role in the 
success of many machine learning algorithms and techniques 
such as matrix factorization [71, signal recovery /extraction 
lH], denoising ||9l, ifTOl . compressed sensing [11 1, dictionary 
learning lfT2ll . signal representation |13j, lfT4l . support vector 
machines lITSl. sampling theory [I6j, LIT! and source separa- 
tion/localization ifTSl . lfT9l . For example, one method of source 
separation is to transform the signal to a domain in which it is 
sparse (e.g. time-frequency or wavelet) where the separation 
can be performed by a partition of the transformed signal space 
due to the sparsity of the representation |20|, |21|. There has 
also been research in the uniqueness of sparse solutions in 
overcomplete representations ll22l . Il23l . 

There are many measures of sparsity. Intuitively, a sparse 
representation is one in which a small number of coefficients 
contain a large proportion of the energy. This interpretation 
leads to further possible alternative measures. Indeed, there are 
dozens of measures of sparsity used in the literature. Which 
of the sparsity measures is the best? In this paper we suggest 
six desirable characteristics of measures of sparsity and use 
them to compare fifteen popular sparsity measures. 

Considering the nebulous definition of sparsity we begin by 
examining how a sparsity measure should behave in certain 
scenarios. In Sec.|II]we define six such scenarios and formalize 
these scenarios in six mathematical criteria that capture this 
desirable behavior We prove two theorems showing that 
satisfaction of some combinations of criteria result in au- 
tomatic compliance with a different criteria. In Sec 



III 



we 

introduce the most commonly-used sparsity measures in the 
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literature. We elaborate on one of these measures, the Gini 
Index, as it has many desirable characteristics including the 
ability to measure the sparsity of a distribution. We also show 
graphically how some measures treat components of different 
magnitude. In Sec.|lV]we present the main result of this work, 
namely, the comparison of the fifteen commonly-used sparsity 
measures using the six criteria. We show that the only measure 
to satisfy all six is the Gini Index. Proofs of the table are 
attached in Appendices [A| and [B| A preliminary report on these 
results (without proofs) appeared in (24). We then compare the 
fifteen measures graphically on data drawn from two sets of 
parameterized distributions. We select distributions for which 
we can control the 'sparsity'. This allows us to visualize the 
behavior of the sparsity measures in view of the sparse criteria. 
In Sec. [V]we present some conclusions. The main conclusion 
is that from the fifteen measures, only the Gini Index satisfies 
all six criteria, and, as such, we encourage its use and study. 

II. The Six Criteria 

The following are six desirable attributes of a measure 
of sparsity. The first four, Dl through D4, were originally 
applied in a financial setting to measure the inequity of wealth 
distribution in |25|. The last two, PI and P2, were proposed in 
|26|. Distribution of wealth can be used interchangeably with 
distribution of energy of coefficients and where convenient 
in this paper, we will keep the financial interpretation in the 
explanations. Inequity of distribution is the same as sparsity. 
An equitable distribution is one with all coefficients having 
the same amount of energy, the least sparse distribution. 

Dl Robin Hood - Robin Hood decreases sparsity (Dalton's 
1st Law). Stealing from the rich and giving to the poor 
decreases the inequity of wealth distribution (assuming 
we do not make the rich poor and the poor rich). This 
comes directly from the definition of a sparse distribution 
being one for which most of the energy is contained in 
only a few of the coefficients. 

D2 Scaling - Sparsity is scale invariant (Dalton's modified 
2nd Law ll27l ). Multiplying wealth by a constant factor 
does not alter the effective wealth distribution. This 
means that relative wealth is important, not absolute 
wealth. Making everyone ten times more wealthy does 
not affect the effective distribution of wealth. The rich 
are still just as rich and the poor are still just as poor 

D3 Rising Tide - Adding a constant to each coefficient 
decreases sparsity (Dalton's 3rd Law). Give everyone a 
trillion dollars and the small differences in overall wealth 
are then negligible so everyone will have effectively 
the same wealth. This is intuitive as adding a constant 
energy to each coefficient reduces the relative difference 
of energy between large and small coefficients. This law 



assumes that the original distribution contains at least two 
individuals with different wealth. If all individuals have 
identical wealth, then by D2 there should be no change 
to the sparsity for multiplicative or additive constants. 

DA Cloning - Sparsity is invariant under cloning (Dalton's 4th 
Law). If there is a twin population with identical wealth 
distribution, the sparsity of wealth in one population is 
the same for the combination of the two. 

PI Bill Gates - Bill Gates increases sparsity. As one indi- 
vidual becomes infinitely wealthy, the wealth distribution 
becomes as sparse as possible. 

P2 Babies - Babies increase sparsity. In populations with 
non-zero total wealth, adding individuals with zero wealth 
to a population increases the sparseness of the distribution 
of wealth. 

These criteria give rise to the sparsest distribution being one 
with one individual owning all the wealth and the least sparse 
being one with everyone having equal wealth. 

Dalton ESll proposed that multiplication by a constant 
should decrease inequality. This was revised to the more 
desirably property of scale invariance. Dalton's fourth prin- 
ciple, DA, is somewhat controversial. However, if we have a 
distribution from which we draw coefficients and measure the 
sparsity of the coefficients which we have drawn, as we draw 
more and more coefficients we would expect our measure of 
sparsity to converge. DA captures this concept. 

'Mathematically this {DA} requires that the measure 
of inequality of the population should be a function 
of the sample distribution function of the population. 
Most common measures of inequality satisfy this last 
principle.' lIZTll 

Interestingly, most measures of sparsity do not satisfy this 
principle, as we shall see. 

We define a sparse measure S as the a function with the 
following mapping 



S 



U C" 



(1) 



where n € N is the number of coefficients. Thus S maps 
complex vectors to a real number 

There are two crucial, core, underlying attributes which our 
sparsity measures must satisfy. As all measures satisfy these 
two conditions trivially we will not comment on them further 
except to define them. 

A\ S{c) = S{llc) where 11 denotes permutation, that is, the 
sparsity of any permutation of the coefficients is the same. 
This means that the ordering of the coefficients is not 
important. 

A2 The sparsity of the coefficients is calculated using the 
magnitudes of the coefficients. This means we can assume 
we are operating in the positive orthant, without loss of 
generality. 

By A2 we can assume we are operating in the positive 
orthant, and as such we can rewrite ([T]l as 



S 



u 



(2) 



which is more consistent with the wealth interpretation. 

We will use the convention that S{c) increases with increas- 
ing sparsity where c = [ Ci C2 • • • ] are the coefficient 
strengths. Given vectors 

C = [ Ci C2 • • • Cat ] 

d = [ di d2 ■■■ dM ] 
we define concatenation, which we use 11 to denote, as 



c\\d= [ Ci C2 



Cat 



do 



We also define the addition of adding a constant to a vector 
as the addition of that constant to each element of the vector, 
that is, for ainM, 



= [^1 



a C2 



Cat + a 



The six sparse criteria can be formally defined as follows: 
Dl Robin Hood: 

S{[ ci ■■■ Ci-a ... Cj^a ...])< S{c) for all 

a, Ci, Cj such that q > Cj and < a < ^' ^ . 
D2 Scaling: 

S{ac) = S{c), Va e M, a > 0. 
Di Rising Tide: 

S{a + c) < S{c), a e M, a > (We exclude the case 

ci = C2 = C3 = • • • = Ci = • • • Vi as this is equivalent to 

scaling.). 
DA Cloning: 

S{c) = S{c\\c) = S{c\\c\\c) = S{c\\c\\ • • • lie). 
PI Bill Gates: 

Va^a = 13, > 0, such that Va > : 

S{[ ci ... c^+P + a ...])>S{[ci ... c+P 

P2 Babies: 

S{c\\Q) > S{c). 
As stated above, when proving Rising Tide we exclude the 
scenario where all coefficients are equal. In this case, adding 
a constant is actually a form of scaling. Another interpretation 
is that the case with all coefficients equal is, in fact, the 
minimally sparse scenario and hence adding a constant cannot 
decrease the sparsity. 

A. Two Proofs 

As one would surmise there is some overlap between the cri- 
teria. We present and prove two theorems which demonstrate 



this overlap. Theorem 2.1 states that if a measure satisfies 
both criteria D\ and D2, the sparsity measure also satisfies 
PI by default. Theorem |2.2| states that a measure satisfying 
D\, D2 and DA necessarily satisfies P2. 

Theorem 2.1: Dl k D2 ^ PI, that is, if both Dl and 
D2 are satisfied, PI is also satisfied. 

Proof: Without loss of generality, we begin with the 
vector c sorted in ascending order 



[^1 



C2 



Cjv 



with ci < C2 < • • • < Cat . We then perform a series of inverse 
Robin Hood steps to get a vector d, that is, we take from 
smaller coefficients and give to the largest coefficient 



d, 
dN 



Ci - 

cn 



Ac, 
f Aci 



Vi = 1,2, 



,7V- 1 



with condition A < 1. As these are inverse Robin Hood steps 
(inverse Dl), they increase sparsity and result in the vector 



[(ci-Aci) (C2-Ac2) 

(cjv-i - Acjv_i) (Aci + ■ 



- Acjv_ 



■ cjv)] 



Without affecting the sparsity we can then scale (D2) d by 
to get 



C2 
CN-1 

C2 
CN-1 



where 



1 



1- A 



a + Cjv] , 

(Aci + Ac2 -f 



+ Acjv-l + Cjv) 



Acjv). 



It is clear that 



S{e) = S{d) > S{c), 



which is equivalent to PI with the given a and /3 = 0. If 
we wish to operate on ci (instead of cjv as above), /3 can be 
chosen sufficiently large to make the desired coefficient the 
largest, that is, we set 

I3> CN - Ci 

■ 

Theorem 2.2: Dl & D2 & D4 P2, that is, if Dl, 
D2 and DA are satisfied, P2 is also satisfied. 
Proof: We begin with vector c 

C = [ Ci C2 ... CN ] ■ 

We then clone (DA) this + 1 times to get 



C = 



N+l 

We then take one of the c from C, which we shall refer to as 
c and by a series of inverse Robin Hood operations (Dl) we 
distribute this c in^accordance with the size of each element to 
form new vector D. That is to say, each Ci of each c (excluding 
c) becomes Ci + hy N consecutive inverse Robin Hood 
operations which increase sparsity. The result is 



D 











N 



N 



We can then scale (D2) D by a factor of without 
affecting the sparsity to get 



E = 



c 







N N 
which by cloning (D4) we know is equivalent to 

c ] . 

In summation, we have shown that 

S{c) = S{C) < S{D) = S{E) = S{F), 

that is, 

S{c) < 5(clO) 
which is also known as P2. 



(3) 
(4) 



III. The Measures of Sparsity 

In this section we discuss a number of popular sparsity mea- 
sures. These measures are used to calculate a number which 
describes the sparsity of a vector c = [ ci C2 ... cjv ]. 
The measures' monikers and their definitions are listed in 
Table [l] Some measures in Table [l] have been manipulated 
(in general negated) to ensure that the an increase in sparsity 
results in a (positive) increase in the sparse measure. 

TABLE I 

Commonly used sparsity measures modified to become more 
positive for increasing sparsity. 



Measure 



Definition 



:0} 



# {h Cj < e} 



E 



, < p < 1 



-tanh„ 



-Ejtanh((acj7y 



- log 

K4 



Ug 



1 - mini^i^2....,JV-reiVl+l 
s.t. 19N] ^ N for ordered data, 



'^(j+rajvl-i)-g(i) 



C(l)^C(2)_< 



< c, 



c^, P<0 



(N]_ 



■ Cj log Cj ^ where Cj 



H' 



Hoyer 



Gini 



)(VJV-1)-i 



J- ^Z.fc = l \^ JV 

for ordered data, 

< C(2) < ■ ■ ■ < C(jv) 



In ll28l the f', l'^, iP, tanha b, log and K4 were compared. 
The most commonly used and studied sparsity measures are 
the IP norm-like measures. 



for < p < 1. 



The measure simply calculates the number of non-zero 
coefficients in c, 

l|cllo = #{c, ^0,i = l,...,iV}. 

The measure is the traditional sparsity measure in many 
mathematical settings. However, it is unsuited to most practical 
scenarios, as an infinitesimally small value is treated the same 
as a large value. This means that the derivative of the measure 
contains no information and as such the f' cannot be used in 
optimization problems. Exhaustive search is the only method 
of finding the sparsest solution when using the f' measure and 
approximations are usually used {29L 130 1. The presence of 
noise makes the f' measure completely inappropriate. In noisy 
settings, the measure is sometimes modified to where we 



are interested in the number of coefficients, Cj that are greater 
than a threshold e fSl |. Clearly, the value of e is crucial for 
to be meaningful. This is undesirable. As optimization using 
is difficult because the gradient yields no information, P with 
< p < 1 is often used in its place, |l32l|. The £^ measure, 
that is, £P with p = 1, approximates the measure and is 
easily calculated. Under this measure, large coefficients are 
considered more important than small coefficients unlike the 

measure. In most settings, the £^ solution can be used to 
find the support of the solution ll33l . The £^ measure is 
used in many optimization problems, as linear programming 
offers a fast, computationally efficient solution |34|, |35|. 

In ll28l several alternative measures of sparsity are noted 
which approximate the measure but emphasize different 
properties, tanha.b is sometimes used in place of £p, < p < 
1, as it is limited to the range (0, 1) and better models and 

in this respect. A representation is more sparse if it has one 
large component, rather than dividing up the large component 
into two smaller ones, tanh^.b and £p preserve this. In I1261 it 
is shown that the log measure enforces sparsity outside some 
range, but for distributions with low energy coefficients the 
opposite is achieved by effectively spreading the energy of 
the small components. K4 is the kurtosis which measures the 
peakedness of a distribution |36|. uq measures the smallest 
range which contains a certain percentage of the data. This 
is achieved by sorting the data and determining the minimum 
difference between the largest and smallest sample in a range 
containing the specified percentage (6) of data points as a 
fraction of the total range of the data. The reason that a 
continuous parameter 9 is used in the model is to maintain 
compatibility with pre-existing literature. 

For measuring 'diversity', ||37l . Il38ll use some different 
measures. Three of these are entropy measures: the Shannon 
entropy diversity measure Hs, a modified version of the Shan- 
non entropy diversity measure Hsf and the Gaussian entropy 
diversity measure Hq- They also extend the measure to 
negative exponents, that is, —1 < p < 0. We call this measure 
£^ to avoid confusion. 

Some of the measures can be normalized to satisfy more of 
the constraints, although in general for the measures, forcing 
satisfaction of one constraint means breaking another. The 
exception to this is the Hoyer measure |39| which is a 
normalized version of the |r measure as is obvious from its 
definition, (y/N — j2){VN — 1)"^. In Fig. [T| we can get an 
insight into how component magnitude affect certain measures. 
In general, the smaller the magnitude the less it impinges on 
the sparsity of the measure. We can see how many of the 
measures approximate the measure but as they are not flat 
like the £° measure, they have a gradient that can be used in 
optimization problems. The £'^, £^, tanh, log, £^{0 < p < 1), 
£^ measures all prefer components to be zero or near zero. 
Oddly, the Shannon entropy based measures Hs and Hsf 
prefer components to be at a non-zero value less than 1. 

A. The Gini Index 

Having perused the measures thus far, some desirable 
aspects of a sparsity measure emerge. Like p- and Hoyer, 



a measure should be some kind of weighted sum of the 
coefficients. This means that unlike £° when a coefficient 
changes slightly we have a weighted effect on the correspond- 
ing change in the value of the sparsity measure based on 
how 'important' that particular coefficient is to the overall 
sparsity. Large coefficients should have a smaller weight than 
the small coefficients so that they do not overwhelm them to 
the point that smaller coefficients have a negligible (or no) 
effect on the measure of sparsity. If even one of the smaller 
coefficients is changed, that change should be reflected by a 
change in the value of the sparsity measure. A weighted sum 
achieves this. In other words, we have a gradient which we can 
use in optimization problems. Another important aspect of a 
sparsity measure is normalization. A set of coefficients should 
not be rated more or less sparse simply because it has more 
coefficients than another set, nor should it be deemed more or 
less sparse simply due to having louder or quieter coefficients. 
In short, there should be two forms of normalization. Firstly, 
the measure of sparsity should be dependent on the relative 
values of coefficients as a fraction of the total value. Secondly, 
the measure of sparsity should be independent of the number 
of coefficients so that sets of different size can be compared. 
Lastly, it would be useful if the measure was for the least 
sparse case and 1 for the most sparse case. All these qualities 
are embodied by the Gini Index, which we now define. 

Given a vector, c = [ Ci C2 C3 • • • ] , we order from 
smallest to largest , C(i) < C(2) < ••• < C(7v) where 
(1), (2), . . . , (TV) are the new indices after the sorting oper- 
ation. The Gini Index is given by 

C(k) {N -k+\\ 

The Gini Index also has an interesting graphical interpre- 
tation which we see in Fig. |2] If percentage of coefficients 
versus percentage of total coefficient value is plotted for the 
sorted coefficients we can define the Gini Index as twice the 
area between this line and the 45° line. The 45° line represents 
the least sparse distribution, that with all the coefficients being 
equal. 

If we have a distribution from which we draw coefficients 
and measure the sparsity of the coefficients which we have 
drawn, as we draw more and more coefficients we would 
expect our measure of sparsity to converge. The Gini Index 
meets these expectations. The Gini Index of a distribution 
with probability density function f{x) (which satisfies f{x) — 
0,a; < 0) and cumulative distribution function F{x) is given 
by 

Jo j^tf{t)dt ^ ' 

As a side note, the Gini Index was originally proposed in 
economics as measure of the inequality of wealth |40|, |41|, 
||25]| . II27I and is still studied in relation to wealth distribution 
as well as other areas. 133, 143], ||44|, 1451 'Inequality 
in wealth' in signal processing language is 'efficiency of 
representation' or 'sparsity'. The utility of the Gini Index as a 
measure of sparsity has been demonstrated in l26l . l46l . l47l . 

ii. 



-4^ < < < < < < < 1 

0.5 1 1.5 2 2.5 3 3.5 4 

Component amplitude 

Fig. 1. Component contribution to sparsity measure vs component amplitude. 

TABLE II 

Most common counter-example for a given property with measure of sparsity and desired outcome with sparsity measure S(-). 



Property Most common counter-example Desired outcome 



Dl 


[0,1,3,5] 


vs 


[0,2,3,4] 


S([0,l,3,5]) 


> 


5([0,2,3,4]) 


D2 


[0,1,3,5] 


vs 


[0,2,6,10] 


£•([0,1,3,5]) 




S([0,2,6, 10]) 


D3 


[1,3,5] 


vs 


[1.5,3.5,5.5] 


5{[1,3,5]) 


< 


S([1.5,3.5,5.5]) 


D4 


[0,1,3,5] 


vs 


[0,0,1,1,3,5] 


S([0.1,3,5]) 




S([0,0, 1,1,3,5]) 


PI 


[0,1,3,5] 


vs 


[0,1,3,20] 


S([0,l,3,5]) 


< 


S([0,l,3, 20]) 


P2 


[0,1,3,5] 


vs 


[0,0,0,1,3,5] 


5'([0,1,3,5]) 


< 


S([0,0,0, 1,3,5]) 




Fig. 2. Percentage of coefficients versus percentage of total coefficient value 
is plotted for the sorted coefficients for [0 1] (top) and [1 1 2 3 10] 
(bottom). The Gini Index is twice the shaded area. 



IV. Comparison of Sparsity Measures 

In this section we present the main result of the paper, 
the comparison of the measures using the criteria. Many of 
the measures fail for simple test cases which prove non- 
compliance. For example, [0, 1, 3, 5] is more sparse than 
[0,2,3,4] because a Robin Hood operation maps one sequence 
to the other. Six of the measures do not correctly handle this 
case. Others fail on similar examples. Seven of the measures, 
however, satisfy Dl. An example for each sparse criterion is 
given in Table [ll] along with the desired outcome when the 
sparsity of the examples are measured with sparsity measure 
S{-). Table III details which of the six sparse criteria hold 
for each of the fifteen measures. The information is based 
on proofs and counter-examples which are contained in their 
entirety in Appendices |A] and |B] There are essentially two 
types of proof, Type A and Type B. Type A is the standard 
form of proof which uses inequalities, an example of which 
is the following: 

Theorem 4.1: jpr satisfies 



I) < 5(c), 



for all a,Ci, Cj such that > Cj and < a < 
Pmof: As - ^^SP 



we can restate the above as 



Efc Cfc + a - a ^ 
This simplifies to 

(ci - af + {cj + a)"^ <cf + c^^ 
Expand this to get 



Cj — 2cia + a + Cj + 2c ja + a 



< 4 
a < 0, 



which we know is true as < a < ^ 



A type B proof on the other hand uses derivatives, for 
example: 

Theorem 4.2: ~~£p satisfies 



S{[ ci 



Cj+a •••])< S{c), 



for all a,Ci, Cj such that Ci > Cj and < a < 
Proof: 

i/p 

<p < 1. 



We wish to show that the following holds true for all a, Ci,Cj 
such that Ci > Cj and < a < '^'^'^^ 

i/p- 



d_ 
da 



< 0. 



Expand this to get 

-I (E.^,., 4 + (c. - af + (c, + ar) (-p(c, - a) 
+p(c, + a)f-i) < 0. 

Which holds true if 

(c, + - (c, - > 0. 

As p — 1 < we can rewrite the above as 
1 1 

> 



p-i 



> 



1 



(cj - a) 



(cj + a) 

Ci 

Ci Cj 

2 

which is necessarily true as it is one of the constraints upon 
a. ■ 



a > Cj + a 



> a, 



From Table III we can see that D3 (Rising Tide) is satisfied by 
most measures. This shows that relative size of coefficients is 
of the utmost importance when desiring sparsity. As previously 
mentioned, most measures do not satisfy D4 (Cloning). Each 
of the other criteria is satisfied by a varying number of the 
fifteen measures of sparsity. This demonstrates the variety of 
attributes to which measures of sparsity attach importance. K4 
and the Hoyer measure satisfy most of the criteria. The Gini 
Index alone satisfies all six criteria. 



TABLE III 

Comparison of different sparsity measures using criteria 

DEFINED IN SEcUTI 



Measure 


Dl 


D2 


D3 


D4 


P\ 


P2 


£" 
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A. Numerical Sparse Analysis 

In this section we present the results of using the fifteen 
sparse measures to measure the sparsity of data drawn from 
a set of parameterized distributions. We select data sets and 
distributions for which we can change the 'sparsity' by altering 
a parameter. By applying the fifteen measures to data drawn 
from these distributions as a function of the parameter, we can 
visualize the criteria. The examples are based on the premise 
that all coefficients being equal is the least sparse scenario 
and all coefficients being zero except one is the most sparse 
scenario. 

In the first experiment we draw a variable number of 
coefficients from a probability distribution and measure their 
sparsity. We expect sets of coefficients from the same distribu- 
tion to have a similar sparsity. As we increase the number of 
coefficients we expect the measure of sparsity to converge. In 
this experiment we examine the sparsity of sets of coefficients 
from a Poisson distribution (Fig. [3]l with parameter A = 5 
as a function of set size. From the normalized version of 
the sparsity plot in Fig. [4] we can see that three measures 
converge. They are K4, the Hoyer measure and the Gini Index. 
As this is similar in nature to DA we expect the Gini Index to 
converge. The convergence of Hoyer measure is unsurprising 
as this measure almost satisfies DA especially for large N . 
The results are also normalized for clearer visualization in that 
they are modified so that the sparsity falls between and 1. 

In the second experiment we take coefficients from a 
Bernoulli distribution where coefficients are either with 
probability p or 1 with probability 1 — p. For this experiment 
the set size remains constant and the probability p varies from 
to 1. With a low p most coefficients will be 1 and very 
few zero. The energy distribution of such a set is not sparse 
and accordingly has a low value (see Fig |5]l. As p increases 
so should the sparsity measure. We can see this is the case in 
some form for all of the measures except HsL We note that K4 
does not rise steadily with increasing p but rises dramatically 
as the set approaches its sparsest. This is of some concern if 
optimizing sparsity using K4 as there is not much indication 
that the distribution is getting more sparse until its already 




10 20 30 40 50 60 70 80 90 100 



Fig. 3. Sample Poisson distribution probability density functions for A = 5, 10, 15, 30. We expect the distributions with a 'narrower' peak (small A) to have 
a higher sparsity than those with a 'wider' peak (large A) 




200 400 600 800 1000 1200 1400 1600 1800 2000 

No. of samples 

Fig. 4. Sparsity of sets of coefficients drawn from a Poisson distribution (A — 5) vs the lengtii of tiie vector of coefficients. The erratically 
ascending measures are £° and £°. The measures log, tanh, Hq, H's and £^ are grouped in an almost-straight decreasing line. The 
measures are scaled to be between and 1. 




Fig. 5. Sparsity i'.v p for a Bernoulli distribution with coefficients being with probability p and 1 otherwise. The measures are scaled to fit between a 
sparsity range of to 1. 



quite sparse. 



V. Conclusions 

In this paper we have presented six intuitive attributes of a 
sparsity measure. Having defined these attributes mathemati- 
cally, we then compared commonly-used measures of sparsity. 
The goal of this paper is to provide motivation for selecting a 
particular measure of sparsity. Each measure emphasizes dif- 
ferent combinations of attributes and this should be addressed 
when selecting a sparsity measure for an application. We can 
see from the main contribution of this paper. Table |lll] and 
the associated proofs in Appendices [A] and [B] that the only 
measure to satisfy all six criteria is the Gini Index. This aligns 
well with B6ll in which it is shown that the Gini Index is 
an indicator for when sources are separable, a property which 
itself relies on sparsity. The Hoyer measure [39 J comes a close 
second, failing only Di (invariance under cloning), which is, 
admittedly an arguable criterion for certain applications. For 
applications in which the number of coefficients is fixed both 
the Gini Index and the Hoyer measure satisfy all criteria. 

We have also presented two graphical examples of the 
performance of the measures when quantifying the sparsity 
of a distribution with sparsity controlled. Again, both the Gini 
Index and the Hoyer measure outperform the other measures, 
illustrating their utility. 

Sparsity is used in many applications but with few excep- 
tions it is not studied as a concept in itself. We hope that 
this work will not just encourage the use of the Gini Index 
but encourage users of sparsity to consider in more depth the 
concept of sparsity. 



Appendix 

We use these measures to calculate a number which 
describes the sparsity of a set of coefficients c = 

[ Ci C2 • • • Cat ] . 

Note - ignore the trivial cases, for example, D2 with a = 1. 
Dl Robin Hood: 

S{[ ci ■■■ a- a ... Cj^a ...])< S{c) for all 

a, Ci, Cj such that q > Cj and < a < . 
D2 Scaling: 

S{ac) = S{c), Va e M, a > 0. 
Di Rising Tide: 

S{a + c) < S{c), a e M, a > (We exclude the case 

ci = C2 = C3 = • • • = Ci = • • • Vz as this is equivalent to 

scaling.). 
DA Cloning: 

S{c) = 5(clc) = S{c\\m = Sim ■ ■ ■ lie). 
PI Bill Gates: 

Va/3 = 13, > 0, such that Va > : 

S{[ ci ... c,+(5 + a ...])>S{[ci ... c,+(5 ...]). 

P2 Babies: 

S{c\\0) > S{c}. 

A. Counter-Examples 

The most parsimonious method of showing non-compliance 
with the sparse criteria is through the following simple 
counter-examples. As an sample we take the measure 
and Dl. Dl states that the £^ measure of [0, 1,3,5] should 
be greater than the £^ measure of [0,2,3,4]. Using counter 
example we see that 

^([0,1,3,5]) = -9 
^([0,2,3,4]) = -9. 
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As the Robin Hood operation had no effect on the sparsity of 
the vectors as measured by the f measure the measure does 
not satisfy Dl. In the case of —fL the zeros in the counter- 
examples are omitted. 
Counter Example A.l: 

[0,1,3,5] vs [0,2,3,4] 

Counter Example A.l (*): 

[.3,1,2] vs [.31,. 99, 2] 

Counter Example A.2: 

[0,1,3,5] vs [0,2,6,10] 

Counter Example A.3: 

[1,3,5] vs [1.5,3.5,5.5] 

Counter Example A.3 (*): 

[.1,.3,.5] vs [.15, .35, .55] 

Counter Example A.4: 

[0,1,3,5] vs [0,0,1,1,3,5] 

Counter Example A.5: 

[0,1,3,5] vs [0,1,3,20] 

Counter Example A.6: 

[0,1,3,5] vs [0,0,0,1,3,5] 

B. Proofs 

This section contains the proofs that were longer than 



Table IV permitted. The obvious method of proving that 
the measures satisfy the criteria, is to plug the formulae for 
the measures into the mathematical definitions of the six 
criteria. Another method used below is to differentiate the 
modified sparse measure with respect to the parameter that 
modifies it and observe the result. For example if we show 
that ^'^g^^^ < for a > this proves D'6 as any change in 
a causes the measure to drop. 

1} and Dl: 

Theorem A.l: satisfies 



S{[ ci 



\)<S{c) 



for all a, Ci, Cj such that Ci > cj and < a < 
Proof: 

i/p 



We wish to show that the following holds true for all a, Ci, Cj 

i/pI 



such that Ci > Cj and < a < 



d_ 
da 



< + ic^ - a)P + (c, + ay 



< 0. 



l ( E.^.,, 4 + (q - ar + (c, +ar]' ' (-p(c. - aV-' 



Which holds true if 

(c, + ay^^ - (c, - ay~^ > 0. 

As p — 1 < we can rewrite the above as 
1 



1 

- ay-p 


> 





1 


> 


1 


{c-j + a) 


{ci - a) 


Ci - a 


> 


Cj + a 


Ci — Cj 

2 


> 


a, 



which is necessarily true as it is one of the constraints upon 
a. m 

2) -iP and D3: 

Theorem A.2: ~£p satisfies 



Proof: 



S{a + c) < S{c), a em., a>0. 



1/p 



3) jY and Dl: 

Theorem A.3: |r satisfies 



d — a ■ ■ ■ Ci + a 



])<S{c) 



for all a,Ci, Cj such that Cj > Cj and < a < 

Proof: As jY = / we can restate the al 

J2k Ck + a - a 
This simplifies to 



< 



{c^ - a)'^ + {cj + af < cj + c]. 



Ci - 2cia + a'^ + c^ + 2cja + a'^ < cf + cj 
Cj — Ci + a < 0, 



which we know is true as < a < ^ 



4) It and D3: 

Theorem A.4: jr does not satisfy 

S{a + c) < S{c), a e M, a > 0. 

Proof: 

To simplify matters we make the following substitutions 

^1 = X^cj 



Si 



= E' 



(6) 



and note that si^ > S2. We now have 



\/s2 + 2asi + Na^ 



< 



/S2 
Si 



Si + Na 

si^{s2 + 2asi+ Na^) < S2{si^ + 2siN a + N'^ a^) 



a < 



N 



S2 - Si 



2si XMsi"^ - S2 



which is false as (y ^^^^l^^ < which violates the condition 
a > 0. ■ 

5) C and PI: 

Theorem A.5: fr satisfies Va/3 = /?j > 0, Va > : 



6) — tanha b and Dl: 

Theorem A.6: — tanho^b satisfies 

5([ ci ••• Ci-a ■■■ Cj + a •••])< 5(c), 

for all a, Ci, cj such that c, > Cj and < a < . 
Proof: Need to show that 

— tanh (acj — aa)*"— tanh (acj + aoif < — tanh (acj)*"— tanh {acj) 

Making the substitutions x = aci, y = acj and ^; = aa we 
get 

tanh {x — z)'' + tanh {y + z)^ > tanh (x)^ + tanh (y)'' 

with a;>y>OandO<2;< Setting 

= (tanh(a; - z)'' - tanh(a;)'') + (tanh(j/ + zf - tanh(y)'') , 

we use the mean value theorem of differential calculus to prove 
that 

tanh(a; - ^)'' - tanh(.T)'' = (l - tanh^ (6'i)'')) 

tanh(y + ^)'' - tanh(y)'' = ^6 (l - tanh^ (6I2)'')) 

where x — z < 0i < x and y < 02 < y + z. However, because 
1 — tanh^(a;'') is strictly decreasing for a; > and b > 
because z < '^y + zKx — z, it follows that 

f{z) = zb [(1 - tanh^ (^2)'')) - (l - tanh^ (^1)''))] > 0. 



S{[ ci ... Ci+l3 + a ...])>S{[ci ... Ci+/3 
Proof: We make the following substitutions 

j 

S2 = y^c^-^ 

and wish to show that 



. . ]). 7) —tanhafi and D3: 

Theorem A.7: — tanho^b satisfies 

S{a + c) < S{c), a e M, a > 0. 

Proof It is enough to show that ^^q'^^ < as if the 
derivative of the measure with respect to the parameter a is 
negative then any a causes the measure to drop. 



Squaring both sides and cross-multiplying gives 

2siS2 + 2/3^Ci - 2/3si^ - 2ciSi^ 
" ^ + 2si/3 - S2 - 2/3ci 

We want RHS < and therefore want a /? such that 

2siS2 + W^Cj - 2/?Si^ - 2qsi^ 
si2 + 2si/3 - S2 - 2/3ci 



d_ 
da 



tanh (oa + ac^)' 
= — (1 — tanh^ ((aa + Cja)*")) 6 {aa + acj)^~^ a <0, 



< 0. 



which is true as a, 6 > and tanh 9 < 1. 
8) - log and D3: 
Proof: — log satisfies 

S{a + c) < S{c), a e M, a > 0. 



as 



As the denominator is always positive, we are only interested 
in the numerator, that is, finding a /? such that 

This is satisfied for /3 = si 

S1S2 + Si^Cj - Si^ - CjSi^ < 0, 

which is clearly true. ■ 



(1 + c? 



>0 



Which is true because 



1 + c? 



> l,a > 0. 



9) K4 and D2: 
Theorem A.8: K4 satisfies 



S{ac) = S{c), V a eM, a > 



Proof: 



10) Ki and D3: 
Theorem A.9: K4 satisfies 

S{a + c) < S{c), a e R, a > 0. 

Proof: Set 



2 ■ 



/(«) 



Ei (ci + a) 



(E,(c. + af 
It follows that 

df _ 4 [E. (c. + a)' E. (c. + a)^ - E, (c. + E, (c. + «) 



(E.(c.+«n' 

We can ignore the denominator as it is clearly positive. We 
claim that |£ < for a > 0. This is because, for positive Xi, 
it is always true that 



as 



E/ 'V"^•"3 ' -i-] 
— ^^^^ CCj^Xj ^CCj^Xj I "^j "^i "^j] 

= — XiXj (xj — Xj)^ {xi + Xj) < 0. 



11) K4 and PI; 

Theorem A.IO: K4 satisfies Vz3/3 = /J^ > such that Va > 



Multiplying out and substituting back in for Cj this becomes 

Ci + a + l3> 



E, 



Clearly there exists a /3 such that the above expression holds 
true for all a > 0. ■ 

12) -P_ and PI: 

Theorem A.ll: -f_ satisfies VG/3 = /3j > 0, such that 
Va > : 



S([ ci 



Ci + /3 + a 



])>S{[ Cl 



Ci + /3 



Proof: Without loss of generality we can change the 
conditions slightly by replacing p (p < 0) with —p and 
correspondingly update the constraint to p > 0. 

" E c-^-ia+a+a)-^ > " E 

(a + 13 + a)-" < (c.+/3)-^ 
1 1 
{Ci + I3 + a)p ^ (ci + l3)p ' 

which is true if /? > 0. ■ 
13) Uff and Dl: 

Theorem A.ll: ue does not satisfy 



Ci - a 



Cj + a 



I) < 5(c), 



for all a, Ci, cj such that Cj > cj and < a < \\ 
Proof For 6 = .5, 

S* ([1,2, 4, 9]) = .6667 
5'([1. 1,1.9,4,9]) = .7333. 

The Robin Hood operation increased sparsity and hence does 
not satisfy Dl. ■ 
14) ug and D3." 

Theorem A.13: ue does not satisfy 

S{a + c) < S{c), a e K, a > 0. 



: 

S{[ ci 



Ci+P + a 



d_ 
da 



Proof: The support of c is [c(i), C(jv)]. Assume the sup- 
port of the \0N~\ points that correspond to the minimum is 
]) > S([ ci ... Ci + ...]). [c(fc) , C(j)] . By adding a constant, a, to each coefficient in the 

distribution we shift the distribution to c + a. Clearly, neither 
of the two supports mentioned above changes: (cq) — a) — 
(c(fc) — a) = C(j) — C(fc). Hence ue does not satisfy D3. ■ 
15) Ug and DA: 
Theorem A. 14: uo satisfies 



Proof: Fix i and make the substitution Cj = Cj + /3. We 
show that the derivative of the measure is positive and hence 
the measure increases for any a 



(E,^iC| + (ci+a)2)' 
The numerator of the derivative is 

{c^ + af X^Cj + (c, + a)^) 



> 0. 



^ct + (ci + a)" (ci +a) > 0. 



s{c) = s{c\\c) = s{4c^\c) = Sim ■ ■ ■ m- 

Proof: The support of c is [c(i), C(jv)]- Assume the 
support of the \9N] points that correspond to the minimum 
is [c(fe),C(j)]. The new set, {c\\c\ has 2\N9~\ points lying 
between values cq^ and C(fe), that is, neither of the previously 
mentioned two supports has changed. This reasoning holds for 
cloning the data more than once. Hence ue satisfies DA. ■ 



16} tig and PI: 

Theorem A.15: u$ satisfies VG/3 = /3j > 0, such that Va > 

: 

Si[ ci ... Ci+l3 + a ...])>Si[ci ... Ci+/3 ...]). 

Proof: The support of c is [c(i), C(7v)]- Without loss of 
generality we focus on C(7v) as the effect of adding sufficiently 
large /3 to any other coefficient will result in this coefficient 
becoming the largest. We choose /3 sufficiently large so that 
C(N) + /3 is set sufficiendy far apart from the other coefficients 
for the support of the \0N~\ points that correspond to the 
minimum not to contain C(jv)- Consequently, the numerator 
of the minimization term is a constant K not depending on /? 
or a. We can rewrite 



20) Hoyer and D3: 

Theorem A. 19: Hoyer satisfies 



S{a + c) < S{c), a e M, a > 0. 



Proof: 



d_ 
da 



d_ 
da 



N ■ 



^(Ci + a) 



S{[ ci 
as 



1 - 



Ci + /3 + a 



K 



\)>S{[ ci 



Ci+P ... ]) 



With the substitution 



JV 



< 1- 



K 



C(N) - C(i) + a + P ' C(jv) - C(i) + a 

which is clearly true and the proof is complete. ■ 

17) U0 and P2: 

Theorem A. 16: ue does not satisfy 

5(cl|0)>5(c). 

Proof: Assume c has total support C(7v) ~ C(i) and the 
support of \9N^ points lying between values cq) — ci^^). If 
lies within the range C(j) — C(fe) adding a will decrease the 
range to C(j_i) — C(fe) without increasing the total support. ■ 

18) Hg and Dl: 
Theorem A. 17: Hq satisfies 



S{[ cr 



Ci — a 



Cj + a 



])<S{c), 



for all a, Ci, Cj such that Cj > cj and < a < . 

Proof: 

- J2k^^,J In 4 - In {a -af-ln {cj + a)^ < - Efc In ^ 
-2 In (q - a) - 2 In (c^ + a) < -2 In Cj - 2 In Cj 
(ci - a) (cj + a) > CjCj 

CI Cj ^ 

which is clearly true. ■ 

19) Hoyer and Dl: 
Theorem A.18: Hoyer satisfies 

S'( [ ci • • • Ci — a ■ ■ ■ Cj + a •••])< 5(c) , 

for all a,Ci, cj such that Cj > cj and < a < 



Proof: 
da (ViV - 1) 



d_ 
da 



which is 



N- 



{j:k^ijcl + {ci-ar + {cj+a)^y 



y/N- 



< 0. 



This is true as {cj — Ci — 2a) < 0. 



= E 

i=l 

JV 

= E 



i=l 



this becomes 

{si + Na)'^{s2 + 2asi + Na'^)~i -N{s2 + 2asi+Na'^) < 0, 



which simplifies to 



We rewrite this as 



N > 



Sl 
S2 



i = l 1=1 



which is true by Cauchy-Schwarz. ■ 
21) Hoyer and PI: 

Theorem A.20: Hoyer satisfies VQ/3 = /Si > 0, such that 

Va > : 



S([ ci ... a+P + a 
Proof: 

a (Viv-|^) ^ 

da (ViV - 1) ~ 9a 
which is 



])>S{[ ci ... c,+/3 ... ]). 
/ J2cj + a + l3 



(E,-^,c,2 + (c> + a + /3)'^' 



v/iV- 1 



^ c, (c,- + a + /?) - ^ < 



Clearly for sufficiently large (3 the above quantity is > 0. 
22) Gini and Dl: 

Theorem A.21: The Gini Index satisfies 

S{ci,. . . ,Ci - a,. . . ,Cj + a,. . .) < S{c), 

for all a, Ci, cj such that Cj > Cj and < a < '^^ 

Proof: The Gini Index of c = [ Ci C2 C3 
given by 

ofci-i 2V "^^'^ / ^-fe+i 



fe=i 



IS 



(7) 



where (fc) denotes the new index after sorting from lowest to 
highest, that is, C(i) < C(2) < • • • < C(jv)- 



Without loss of generality we can assume that the two 
coefficients involved in the Robin Hood operation are C(i) and 
C(j). After a Robin Hood operation is performed on c we label 
the resulting set of coefficients d which are sorted using an 
index which we denote [•], that is, djij < d[2] < • • • < (i[Ar]- 
Let us assume that the Robin Hood operation alters the sorted 
ordering in that the new coefficient obtained by the subtraction 
of a from C(i) has the new rank i ~ n, that is. 



and the new coefficient obtained by the addition of a to C(j) 
has the new rank j + m, that is. 



The correspondence between the coefficients of c and d is 
shown in Fig |6] 

and in mathematical terms is 



23) Gini and D2: 

Theorem A.22: The Gini Index satisfies 

S{ac) = S{c), V a e M, a > 0. 

Proof: 



N 



S{ac} 



fc=i 



\\ac\U 



Sic). 



N 



N 



d[k] ~ C(k) 


for 


1 < < j - 1 




d[k\ = C(ft+i) 


for 


j<fc<j+m — 1 


24) Gini and D3: 


d[k] = C(j) + OL 


for 


k = j -\- m 


Theorem A.23: The Gini Index satisfies 


d\k] ~ C(fe) 


for 


j+m+l<fc<i — n— 1 




d[k] = C(i) - OL 


for 


fc = i — n 




d[k] ~ Cfk-i) 


for 


i — n + l<fc<i 


S{a + c) < S'(c), aeR, a 


d{k] ~ C(k) 


for 


i + l <k < N. 





We wish to show 



S{c} > S{d) 



Removing common terms and noting that ||cj|i = \\d\\i we 
can simplify this to 



Proof: Rewriting S{a + c) < S{c) and making the 
substitution 



keA ^ ^ fceA 



where A ~ {j , j + 1 , . . . , j + m, i — n, i — n + 1 , . . . , i} . Using 
the correspondence above we can express the coefficients of 
d in terms of the coefficients of c. We then get 

ELi co+fe) [fA^ - J - + 1 + i) - (iV - J - fc + i^] 

+cU)[lN-J-rn+'^)-{N-j + l)] 
(N-i + n+h)-{N-i+i)] 



+c 



+a[{N - j - m+ ^) - {N ~ i + n+ ^)] > 0, 
which becomes 



we get the following: 



E 

JV 



C(k) 



\c + a\ 



-fik) + 



Na 



N 



c + a||i 



C(fe) 

^ ll^li 
fe=i " " 

JV 



E/w-Em7/(^) > 



Na 



-Na 



fc=i 

JV 



fri M\i Vl|c + a||i/ l|c + a||i fri 



> 0. 



E' 



=(j+fc) 



=0) 



) +E "^f'-*^)) 



+ a{{i-n) - (j + m)) > 0. 

This is true as the two summations are positive as the negative 
component has a lower sorted index than the positive and is 
hence smaller and the last term is positive due to the condition 
on a. ■ 



This is clearly true for A'^ > 1. 
25) Gini and DA: 

Theorem A.24: The Gini Index satisfies 

S{c) = S{c\\c) = S{c\\4c) = S{c\\c\\ • • • lie). 



C(l) C(2) . . . C(j_i) C(j) C(j + l) . . . C(j + m) C(j + m + l) • • • C(i-n-\) C(i-n) ■ ■ ■ C(i) C(i_|.i) . . . C(jv) 




dp] • • • • • • dlj + rn-1] d[j + m,] rf[j + m + l] ' ' ' d[i-n-l] rf[i-n + l] ' ' • ^[i] + ' ' • d[JV) 

Fig. 6. The mapping between a vector before and after a Robin Hood operation. This is used in Proof [B22] 



Proof: We clone c M times to get the vector d which has 
length MN: 



Si c^^r ) = s{d) 

M 

^ djk) ( MN-k+l 

h \\dh V 

M N 



c(fe) /MN-{AH-M + j) + l 



Mm\i V 



MN 



(MN -Mi + M + \ 

i=l II 11^ 7 = 1 



AT 



C(fe) 



'm^TV - M^i + M2 - M(M±i) + M - 
M27V 



N 



1=1 



C(fe) / M'^N - M'^i + M2 - 

Wi [ 



2 2 ' 2 



1-2E 



C(;,) / M^N - M'^i + ^ 



M^N 



N 



1-2E 



t llcll 



TV 



Sic). 



26) Gini and PI: 

Theorem A.25: The Gini Index satisfies Vi3/3 = /3j > 0, 
such that Va > : 



We can simplify the above to 



N 



1 



ll^lli +/3 



/3 



2Af(l|cl|i +/3) 

JV 



^C{,)(7V-i) > 0. 

i = l 

Hence, the Gini Index satisfies PI. ■ 
27} Gini and P2: 

Theorem A.26: The Gini Index satisfies satisfy 

Sim > sic^. 

Proof: Let us define 

d= c\\0 = [ d C2 C3 • • • Cat ] 

and we note that \\d\\i = \\c\\i. Without loss of generality 

)we assign the lowest rank to the added coefficient 0, that is, 
d^v+i — d{i)- We can now make the assertion (i(i+i) — cj^j, 
yielding 



N+l 



(k) 



1-2E 

h Ml 



1 

fN+l-l+k 



Sid) 



Making the substitution i = fc — 1 we get 

Sid!) = i_2f:'(^+^) + + ^ 



\d\ 



N+l 



N 



- 1-2E 



t llcll 



1 



Si[ ci ... Ci+P + a 



> Si[ ci 
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Proof: We use the following notation. 



= 5(c). 



C = {c(l),C(2),...,C(Ar) 

Without loss of generality we have chosen to perform the 
operation on C(jv) as (3 can absorb the additive value needed 
to change any of the C(i) to C(Ar). 
We wish to show that 
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